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Abstract 

We introduce a few of the key ideas of statistical analysis using two real-world 
examples to illustrate how these ideas are used in practice. 

1 Introduction 

These lectures introduce to two broad classes of theories of inference, the frequentist and Bayesian 
approaches. Two points should be made immediately. The first is that there is no such thing as “the" 
answer in statistics. Instead there are answers based on assumptions on which reasonable people may 
disagree. Second, none of the current theories of inference is perfect. It is worth appreciating these 
points in order to avoid fruitless arguments that cannot be resolved because they are ultimately about 
intellectual taste and not mathematical correctness. 

For in-depth expositions of statistical analysis, we highly recommend the excellent books on statis¬ 
tics written for physicists, by physicists and the very insightful book on the history of the ideas by 

Chatterjee Q. 

2 Lecture 1: descriptive statistics, probability and likelihood 
2.1 Descriptive statistics 

Suppose we have a sample of N data X = xi, X2, ■ " It is often useful to summarize these data 
with a few numbers called statistics. A statistic is any number that can be calculated from the data 
and known parameters. For example, t = (xi -|- XAr)/2 is a statistic, but if the value of 9 is unknown 
in f = (xi — 0)^ then the latter is not a statistic. However, we particle physicists tend to refer to any 
function of the data as a statistic including those that contain unknown parameters. 

The two most important statistics are 


the sample mean (or average) 


and the sample variance 


X = 




i=l 

i=l 

1 ^ 


2 -2 
- X , 


2=1 


= x"^ — x^. 
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The sample average is a measure of the center of the distribution of the data, while the sample variance 
is a measure of its spread. Statistics that merely characterize the data are called descriptive statistics, of 
which the sample average and variance are the most important. 

Descriptive statistics can always be calculated because they depend only on a data sample X. We 
now consider numbers that cannot be calculated from the data alone. Imagine the repetition, infinitely 
many times, of the data generating system that yielded our data sample X, thereby creating an infinite 
set of data sets. We shall refer to the data generating system as an experiment and the infinite set of the 
results of the experiments as an infinite ensemble. This is clearly an abstraction. 


The most common operation to perform on an ensemble is to compute the ensemble average of 
the statistics, which yield numbers such as the following. 


Ensemble average 

(x) 

Mean 


Error 

e = X — p 

Bias 

b = (x) — p 

Variance 

V = ((x- (x))2 

Sfandard deviafion 

a = \/V 

Mean square error 

MSE = ((x - pf) 

Roof MSE 

RMS = VMSE 


None of these numbers can be calculated from data because the data needed do not objectively exist. 
Even in an experiment simulated on a computer, there are very few of these numbers we can calculate. 
If we know the mean fx, perhaps because we have chosen its value, we can certainly calculate the error e 
for any simulated datum x. But, we can only approximate the ensemble average {x), bias b, variance V, 
and MSE, since the ensembles available either on our computers or in the real world are always finite. 
The point is that the numbers that characterize the infinite ensemble are also abstractions. 

The MSE is the most widely used measure of the closeness of an ensemble of numbers to some 
parameter p. The square root of the MSE is called the root mean square (RMSQ The MSE can be 
written as 

MSE = F + 6^ (4) 

Exercise 1: Show this 


the sum of the variance and the square of the bias, a very important result with practical consequences. 
Eor example, suppose that p represents the mass of the Higgs boson and x is a complicated function that 
is considered an estimator of the mass. An estimator is any function, which when data are entered into 
it, yields an estimate of the quantity of interest. 

As noted, many of the numbers listed in Eq. Q cannot be calculated because the information 
needed is unknown. This is true, in particular, of the bias. However, sometimes it is possible to relate the 
bias to another ensemble quantity. Consider the ensemble average of the sample variance, Eq. Q, 


Exercise 2a: Show this 


Exercise 2b: Use the method Rndm() of the Root 
class TRandomS to approximate the quantities in 
Eq. Q. 


2.2 Probability 

When the weather forecast specifies fhaf fhere is a 80% chance of snow tomorrow af CERN, mosf people 
have an infuifive sense of whaf fhis means. Eikewise, mosf people have an infuifive undersfanding of 

'Sometimes, the RMS and standard deviation are using interchangeably. However, the RMS is computed with respect to 
fi, while the standard deviation is computed with respect to the ensemble average (x). The RMS and standard deviations are 
identical only if the bias is zero. 
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what it means to say that there is a 50-50 chance for a tossed coin to land heads up. Probabilistic 
ideas are thousands of years old, but, starting in the sixteenth century these ideas were formalized into 
increasingly rigorous mathematical theories of probability. In the theory formulated by Kolmogorov in 
1933, O is some fixed mathematical space, Ei,E 2 , • • • C are subsets (called events) defined in some 
reasonable wa>|^ and P{Ej) is a number associated wifh subsef Ej. These numbers satisfy fhe 

Kolmogorov Axioms 

1. P{Ej) > 0 

2. P{Ei E2 + ■ ■ ■) = P{Ei) + P{E2) + • • • for disjoinl subsefs 

3. P{n) = 1. 

Consider fwo subsefs A = Ei and B = E 2 . The quantify AB means A and B, while A + B means 
A or B, wifh associated probabilities P{AB) and P{A + B), respectively. Kolmogorov assumed, nof 
unreasonably given fhe infuifive origins of probabilify, fhaf probabilifies sum fo unify; hence fhe axiom 
P(ff) = 1. However, fhis assumption can be dropped so fhaf probabilifies remain meaningful even if 
P( 0 ) = 00 |[^. 

Figuresuggesfs anofher probabilify, namely, fhe number P{A\B) = P{AB)/P{B), called fhe 
conditional probability of A given B. This permits statements such as: “the probability that this track 
was created by an electron given the measured track parameters" or “the probability to observe 17 events 
given that the mean background is 3.8 events". Conditional probability is a very powerful idea, but the 
term itself is misleading. It implies that there are two kinds of probability: conditional and unconditional. 
In fact, all probabilities are conditional in that they always depend on a specific set of conditions, namely, 
those that define the space ft. It is entirely possible to embed a family of subsets of into another space 
17' which assigns to each family member a different probability P'. A probability is defined only relative 
to some space of possibilities 17. 

A and B are said to be mutually exclusive if 
P{AB) = 0, that is, if the truth of one denies the 
truth of the other. They are said to be exhaustive if 
P{A)+P{B) = 1. Figure [^suggests the theorem 

P{A + B) = P{A) + P{B)-P{AB), (5) 

Exercise 3: Prove theorem 

which can be deduced from the rules given 
above. Another useful theorem is an immedi¬ 
ate consequence of the commutativity of “anding" 

P{AB) = P{BA) and the definition of P{A\B), 
namely, 

Bayes Theorem 

which provides a way to convert the probability 
P{A\B) to the probability P{B\A). Using Bayes 
theorem, we can, for example, deduce the probability P{e\x) that a particle is an electron, e, given a set 
of measurements, x, from the probability P{x\e) of a set of measurements given that the particle is an 
electron. 

^If El, E 2 , - ■ ■ are meaningful subsets of 17, so to is the complement f?i, f? 2 , • • • of each, as are countable unions and 
intersections of these subsets. 



Fig. 1 : Venn diagram of the sets A, B, and AB. P{A) 
is the probability of A, while P{A\B) = P{AB)/P{B) 
is the probability of AB relative to that of B, i.e., the 
probability of A given the condition B. 
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2.2.1 Probability distributions 

In this section, we illustrate the use of these rules to derive more complicated probabilities. First we start 
with a definition: 

A Bernoulli trial, named after the Swiss mathematician Jacob Bernoulli (1654 - 1705), is 
an experiment with only two possible outcomes: S = success or F = failure. 


Example 

Each collision between protons at the Large Hadron Collider (LHC) is a Bernoulli trial in 
which something interesting happens (S) or does not (F). Let p be the probability of a 
success, which is assumed to be the same for each trial. Since S and F are exhaustive, the 
probability of a failure is 1 —p. Lor a given order O of n proton-proton collisions and exactly 
k successes, and therefore exactly n — k failures, the probability P{k, O, n,p) is given by 

P{k,0,n,p) = p^{l - p)'^~^. (7) 

If the order O of successes and failures is judged to be irrelevant, we can eliminate the order 
from the problem by summing over all possible orders, 

P{k,n,p) = '^P{k,0,n,p) = ^/(l -p)"“^. (8) 

o o 

This procedure is called marginalization. It is one of the most important operations in 
probability calculations. Every term in the sum in Eq. ^ is identical and there are (^) of 
them. This yields the binomial distribution, 

Binomial(k, n, p) = (9) 

By definition, the mean number of successes a is given by 

n 

a = A; Binomial(k, n, p), 

k=0 

= pn. (10) 

Exercise 4: Show this 


At the LHC n is a number in the trillions, while for successes of interest such as the creation 
of a Higgs boson the probability p << 1. In this case, it proves convenient to consider the 
limit p —)> 0, n —)• oo in such a way that a remains constant. In this limit 

Binomial(k, n, p) —)• e~°‘a^/k\, 

= Poisson(A:, a). (11) 

Exercise 5: Show this 


Below we list the most common probability distributions. 




Discrete distributions 

Binomial (A:, n,p) 
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Poisson(A:, a) 


exp(—a)/A:! 



K K K 


Multinomial (/c, n, p) 


Continuous densities 

Uniform(x, a) 

Gaussian(x, p, a) 

(also known as the Normal density) 
LogNormal(x, p, a) 

Chisq(x, n) 

Gamma(x, a, b) 

Exp(x, a) 

Beta(x, n, m) 


exp[—(Inx — (2(T^)]/ {xay/^) 

exp(-x/2)/[2’"/2p(^/2)] 
exp(—ax) /T{h) 
a exp(—ax) 


1/a 

exp[-(x - /i)^/(2fj^)]/(fT\/^) 



Particle physicists tend to use the term probability distribution for both discrete and continuous func¬ 
tions, such as the Poisson and Gaussian distributions, respectively. But, strictly speaking, the continuous 
functions are probability densities, not probability distributions. In order to compute a probability from 
a density we need to integrate the density over a finite set in x. 

2.3 Likelihood 

Let us assume that p{x\0) is a probability density function (pdf) such that P{A\9) = f^p(x\d) dx 
is the probability of the statement A = x ^ Rx, where x denotes possible data, 6 the parameters 
that characterize the probability model (that is the probability together with all the assumptions on 
which it is based), and Rx is a finite set. We shall use probability model as shorthand for probability 
density function (for continuous variables) or probability mass function (pmf) (basically, probabilities 
for discrete variables). If x is discrete, then both p{x\6) and P{A\6) are probabilities. The likelihood 
function is simply the probability model p{x\6) evaluated at the data xq actually obtained, i.e., the 
function p{xo\0). The following are examples of likelihoods. 


Example 1 


In 1995, CDF and D0 discovered the top quark ||^|^ at Fermilab. The D0 Collaboration 
found X = = 17 events. For a counting experiment, the datum can be modeled using 


p{x\n) = Poisson(x, n) probability to get x events 
p{N\n) = Poisson(A^, n) likelihood of N events 


= exp(—n)/A^! 


We shall analyze this example in detail in Fectures 2 and 3. 


Example 2 

Figure 2 shows a plot of the distance modulus versus redshift for N = 580 Type la super¬ 
novae |71. These heteroscedastic dat£0i9 = {zj, Xj ± cjj} are modeled using the likelihood 


N 

p(L>|0m,0a,Q) = ]^Gaussian(xi,/ri,ai), 


^Data in which each item, Xi, or group of items has a different uncertainty. 
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redshift z 

Fig. 2: Plot of the data points {zi, Xi±ai) for 580 Type la supernovae Q showing a fit of the standard cosmological 
model (with a cosmological constant) to these data (curve). 


which is an example of an un-binned likelihood. The cosmological model is encoded in the 
distance modulus function /r*, which depends on the redshift Zi and the matter density and 
cosmological constant parameters Qm and nA^ respectively. (See Ref. p0| for an accessible 
introduction to the analysis of these data.) 


Example 3 


The discovery of a Higgs boson by ATLAS o and CMS in the di-photon final state 
iPP H ^ 77) made use of an un-binned likelihood of the form, 


p(x|s, m, w, b) 

where x 
m 
w 
s 
b 

fs 

h 


N 

exp[-(s -I- b)] ni sfsixi\m,w) + bfb{xi)] 
i=l 

di-photon masses 

mass of boson 

width of resonance 

expected (i.e., mean) signal count 

expected background count 

signal probability density 

background probability density 


Exercise 6: Show that a binned multi-Poisson 
likelihood yields an un-binned likelihood of this 
form as the bin widths go to zero 


The likelihood function is arguably the most important quantity in a statistical analysis because it 
can be used to answer questions such as the following. 
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1. How do I estimate a parameter? 

2. How do I quantify its accuracy? 

3. How do I test an hypothesis? 

4. How do I quantify the significance of a result? 

Writing down the likelihood function requires: 

1. identifying all that is known, e.g., the observations, 

2. identifying all that is unknown, e.g., the parameters, 

3. constructing a probability model for both. 

Many analyses in particle physics do not use likelihood functions explicitly. However, since the data we 
use are stochastic, the failure to reflect deeply on their probabilistic nature and to model them explicitly 
leads to analyses that may not be as good as they could be. Deconstructing carefully what is being done 
in an analysis is a habit that should be encouraged so that an accurate probabilistic model of the analysis 
can be constructed. 


3 Lecture 2: the frequentist approach 

In this lecture, we consider statistical inference from the frequentist viewpoint. In lecture 3, we consider 
the Bayesian approach. In our opinion, both are needed to make sense of statistical inference, though 
this is not the dominant opinion in particle physics. 

The most important principle in the frequentist approach is that enunciated by the Polish statisti¬ 
cian Jerzy Neyman in the 1930s, namely. 


The frequentist principle 

The goal of a frequentist analysis is to construct statements so that a fraction f > pof them 
are guaranteed to be true over an infinite ensemble of statements. 


The fraction / is called the coverage probability, or coverage for short, and p is called the confidence 
level (C.L.). A procedure which satisfies the frequentist principle is said to cover. The confidence level 
as well as the coverage is a property of the ensemble of statements. Consequently, the confidence level 
may change if the ensemble changes. In a seminal paper published in 1937, Neyman 1131 invented the 
concept of the confidence interval, a way to quantify uncertainty, that respects the frequentist principle. 
The confidence interval is such an important idea that it is worth working through the concept in detail. 


3.1 Confidence intervals 

Consider an experiment that observes D events with expected (that is, mean) signal s and no background. 
Neyman devised a way to make statements of the form 

s E [1{D), u{D)], (13) 

with the a priori guarantee that at least a fraction p of them will be true over an ensemble of statements 
of this kind. A procedure for constructing such intervals is called a Neyman construction. The fre¬ 
quentist principle must hold for any ensemble of experiments, not necessarily all making the same kind 
of observations and statements. For simplicity, however, we shall consider the experiments to be of the 
same kind and to be completely specified by a single unknown parameter s. The Neyman construction is 
illustrated in Fig. 

The construction proceeds as follows. Choose a value of s and use some rule to find an interval 
in the space of observations (or, more generally, a region), for example, the interval defined by the two 
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Fig. 3: The Neyman construction. Plotted is the Cartesian product of the parameter space, with parameter s, and 
the space of observations with potential observations D. For a given value of s, the observation space is partitioned 
into three disjoint intervals, such that the probability to observe a count D within the interval demarcated by the 
two vertical lines is / > p, where p = C.L. is the desired confidence level. The inequality is needed because, for 
discrete data, it may not be possible to find an interval with f = p exactly. 


vertical lines in the center of the figure, such that the probability to obtain a count in this interval is / > p, 
where p is the desired confidence level. Then move fo anofher value of s and repeal fhe procedure. The 
procedure is repealed for a sufficienlly dense sel of poinls in fhe parameler space over a sufficienlly large 
range in s. When Ihis is done, as illuslraled in Fig. fhe intervals of probabilily conlenl / will form a 
band in fhe Cartesian producl of fhe parameler space and Ihe observalion space. The upper edge of Ibis 
band defines Ihe curve u{D), while Ihe lower edge defines fhe curve 1{D). These curves are fhe oulcome 
of fhe Neyman conslruclion. 

For a given value of s, fhe inlerval wilh probabilily conlenl / in fhe space of observations is nol 
unique since differenl rules for choosing fhe interval will, in general, yield differenl infervals. Neyman 
suggesfed choosing fhe inlerval so fhaf fhe probabilily fo oblain an observalion lo fhe righl or lefl of fhe 
inlerval are fhe same (for a given value of s), which yields fhe so-called central intervals. One virtue of 
these intervals is that their boundaries can be more efficiently calculated by solving the equations. 


P{x < D\u) = ap, 

P{x > D\l) = an, (14) 


a mathematical fact that becomes clear if we stare at Fig.|^long enough. 

Another rule was suggested by Feldman and Cousins | [I4| ]. For our example, the Feldman-Cousins 
rule requires that the potential observations {D} be ordered in descending order, D(i), D ( 2 )^ ' • •, of 
the likelihood ratio p{D\s)/p{D\s), where s is the maximum likelihood estimator (see Sec. 3.2i of the 
parameter s. Once ordered, we compute the running sum / = Ylj PiP{j)\^) / equals or just 

exceeds the desired confidence level p. This rules does not guarantee that the potential observations D 
are contiguous, but this does not matter because we simply take the minimum element of the set 
to be the lower bound of the interval and its maximum element to be the upper bound. 


Another simple rule is the mode-centered rule: order D in descending order of p(i2|s) and proceed 
as with the Feldman-Cousins rule. In principle, absent criteria for choosing a rule, there is nothing to 
prevent the use of ordering rules randomly chosen for different values of s! Figurej^compares the widths 
of the intervals [l{D),u{D)] for three different ordering rules, central, Feldman-Cousins, and mode- 
centered as a function of the count D. It is instructive to compare these widths with those provided 
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by the well-known root(N) interval, 1{D) = D — '/D and u{D) = D + y/T). Of the three sets of 
intervals, the ones suggested by Neyman are the widest, the Feldman-Cousins and mode-centered ones 
are of similar width, while the root(N) intervals are the shortest. So why are we going through all the 
trouble of the Neyman construction? We shall return to this question shortly. 

Having completed the Neyman construc¬ 
tion and found the curves u{D) and 1{D) we 
can use the latter to make statements of the form 
s G [^(-D), u{D)]: for a given observation D, 
we simply read off the interval [l{D),u{D)] from 
the curves. For example, suppose in Fig. that 
the true value of s is represented by the horizon¬ 
tal line that intersects the curves u{D) and 1{D) 
and which therefore defines the interval demar¬ 
cated by the two vertical lines. If the observa¬ 
tion D happens to fall in the interval to the left 
of the left vertical line, or to the right of the right 
vertical line, then the interval [1{D), u{D)] will 
not bracket s. However, if D falls between the 
two vertical lines, the interval [1{D), u{D)\ will 
bracket s. Moreover, by virtue of the Neyman 
construction, a fraction / of the intervals [1{D), u{D)] will bracket the value of s whatever its value 
happens to be, which brings us back to the question about the root(N) intervals. Figure shows the 
coverage probability over the parameter space of s. As expected, the three rules, Neyman’s, that of 
Feldman-Cousins, and the mode-centered, satisfy the condition coverage probability > confidence level 
over all values of s fhaf are possible a priori', fhaf is, fhe infervals cover. However, fhe rool(N) infervals 
do nol and indeed fail badly for s < 2. 


Width of Intervals 


Feldman-Cousins 


Mode-Centered 


d±Vd 



Central 

F cl dnian- C ousins 

Mode 

Root(N) 


Fig. 4: Interval widths as a function of count D for four 
sets of intervals. 




Poisson Parameter 

Central 

Feldman-Cousins 

Mode 

- Root(N) 

- 68 . 3 % 

Fig. 5: Interval widths as a function of count D for four 
sets of intervals. 

But if we knew that we would not need a theory of s 


However, the coverage probability of the 
root(N) intervals bounces around the (68%) con¬ 
fidence level for vaues of s > 2. Therefore, if we 
knew for sure fhaf s > 2, if would seem fhaf using 
fhe rool(N) intervals may nol be fhaf bad afler all. 

So whal, after all Ihis, does fhe sfalemenf 
s G [1{D), u{D)\ al 100p% C.L. mean in Ihis ap¬ 
proach, given fhaf p is a properly of fhe ensem¬ 
ble lo which Ihis sfalemenf belongs? In means 
Ihis: s G [1{D), u{D)] is a member of an en¬ 
semble of slalemenls a fracfion / > p of which 
are Irue. In principle, in order lo verify Ihis we 
need jusl counf how many slalemenls of Ihe form 
s G [1{D), u{D)] are Irue and divide by fhe lo- 
lal number of slalemenls. Unforlunalely, Ihis re¬ 
quires lhal we know which slalemenls are Irue. 
lislical inference! 


Neyman required a procedure lo cover whatever Ihe value of all Ihe parameters, be Ihey known 
or unknown, of Ihe probabilily models lhal describe Ihe dala generalion mechanisms. This is a very fall 
order, which cannol be mel in general. In praclice, we resorl lo approximalions, Ihe mosl widely used of 
which is Ihe profile likelihood lo which we now lum. 
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3.2 The profile likelihood 


As noted in Section 2.3 likelihood functions can be used to estimate the parameters on which they 
depend. The method of choice to do so, in a frequentist analysis, is called maximum likelihood, a 
method first used by Karl Frederick Gauss and developed into a formidable statistical tool in the 1930s 
by Sir Ronald A. Fisher | [T5| , perhaps the most influential statistician of the twentieth century. The D0 
top quark discovery example illustrates the method. 


Example: Top Quark Discovery Revisited 
We start by listing 

the knowns 

D = N,B where 

= 17 observed events 

B = 3.8 estimated background events with uncertainty 6B = 0.6 
and the unknowns 
b mean background count 
s mean signal count. 


Next, we construct a probability model for the data D = N,B. Since this is a counting 
experiment, we shall assume that p{x\s, b) includes a Poisson distribution with mean count 
s + b. In the absence of details about how the background B was arrived at, the standard 
assumption is that data of the form y E 6y can be modeled with a Gaussian (or normal) 
density. However, we shall do something slightly better. 

Suppose that the observed count in the control region is Q and the mean count is bk, where 
k (ideally) is the known scale factor between the control and signal regions. But, since we 
are given B and 6B rather than Q and k, we need to relate the two pairs of numbers. The 
simplest model is B = Q/k and 6B = / k from which we can infer an effective count Q 

using Q = {B/6By. Since the scale factor k is not given, we shall use the obvious estimate 
k ^ Q/B = B/SB"^. With these assumptions, our likelihood function is 


p{D\s,b) 

where 

Q 

k 


Poisson(A^, s + b) Poisson(Q, bk), 


{B/6B)‘^ = 41.11, 
B/SB"^ = 10.56. 


(15) 


The first term in Eq. (151 is the likelihood for the count = 17, while the second term is 
the likelihood for B = 3.8, or equivalently the count Q. The fact that Q is not an integer 
causes no difficulty; we merely continue the Poisson distribution to non-integer Q using 

{bk)QeM-bk)/r{Q + l). 


The maximum likelihood estimators for s and b are found by maximizing Eq. ( [15] ), that is, 
by solving the equations 


dlnp{D\s,b) _ 
ds 

ob 


leading to s = N — B, 
leading to b = B, 


as expected. 

A more complete model would account for the uncertainty in k. 
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The maximum likelihood method is the most widely used method for estimating parameters be- 
eause it generally leads to reasonable estimates. But the method has features, or eneourages praetiees, 
whieh, somewhat uneharitably, we label the good, the bad, and the ugly! 

- The Good 

- Maximum likelihood estimators are eonsistent: the RMS goes to zero as more and more data 
are ineluded in the likelihood. This is an extremely important property, whieh basieally says 
it makes sense to take more data beeause we shall get more aeeurate results. One would not 
knowingly use an ineonsistent estimator! 

- If an unbiased estimator for a parameter exists the maximum likelihood method will find it. 

- Given the MLE for s, the MLE for any funetion y = g{s) of s is, very eonveniently, just 
y = g{s). This is a very niee praetieal feature whieh makes it possible to maximize the 
likelihood using the most eonvenient parameterization of it and then transform baek to the 
parameter of interest at the end. 

- The Bad (according to some!) 

- In general, MEEs are biased. 


Exercise 7 : Show this 

Hint: Taylor expand y = g{s + h) about the MEE s, 
then eonsider its ensemble average. 

- The Ugly (according to some!) 

- The faet that most MEEs are biased eneourages the routine applieation of bias eorreetion, 
whieh ean waste data and, sometimes, yield absurdities. 

Here is an example of the seriously ugly. 


Example 

Eor a diserete probability distribution p{k), the moment generating function is the ensem¬ 
ble average 


G(a:) = 

k 


Eor the binomial, with parameters p and n, this is 

G{x) = {e^p + 1 — p)'^, Exercise 8a: Show this 
which is useful for calculating moments 

d^G 


Pr = 


dx^ 


x=0 




e.g., p 2 = + np — np^ for the binomial distribution. Given that k events out n pass a 

set of cuts, the MEE of the event selection efficiency is the obvious estimate p = k/n. The 
equally obvious estimate ofp^ is {k/n)‘^. But, 

((fc/n)^) =p^ + V/n, 


Exercise 8b: Show this 
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so (/c/n)^ is a biased estimate of with positive bias V/n. The unbiased estimate of is 


k{k — l)/[n(n — 1)], 


Exercise 8c: Show this 


which, for a single success, i.e., k = 1, yields the sensible estimate p = 1/n, but the less 
than useful = 0! 


In order to infer a value for the parameter of interest, for example, the signal s in our 2-parameter 
likelihood function in Eq. ( [T5] ), the likelihood must be reduced to one involving the parameter of interest 
only, here s, by getting rid of all the nuisance parameters, here the background parameter b. A nuisance 
parameter is any parameter that is not of current interest. In a strict frequentist calculation, this reduction 
to the parameter of interest must be done in such a way as to respect the frequentist principle: coverage 
probability > confidence level. In general, this is very difficult to do exactly. 

In practice, we replace all nuisance parameters by their conditional maximum likelihood esti¬ 
mates (CMLE). The CMEE is the maximum likelihood estimate conditional on a given value of the 
current parameter (or parameters) of interest. In the top discovery example, we construct an estimator of 
6 as a function of s, h{s), and replace b in the likelihood p{D\s, b) by b{s) to yield a function ppl{D\s) 
called the profile likelihood. 

Since the profile likelihood entails an approximation, namely, replacing unknown parame¬ 
ters by their conditional estimates, it is no longer the likelihood but rather an approximation 
to it. Consequently, the frequentist principle is not guaranteed to be satisfied exactly. 

But, if certain conditions are met (Wilks’ theorem, 1938), roughly that the MEEs do not occur on the 
boundary of the parameter space and the likelihood becomes ever more Gaussian as the data become 
more numerous — that is, in the so-called asymptotic limit, then if the true density of x is p{x\s, b) the 
random number 


f(x, s) = —2 In A(x, s), (16) 

whereA{i,,) = ":d?^, (17) 

Ppl[x\s) 

has a probability density that converges to a ^ density with one degree of freedom. More generally, 
if the numerator of A contains m free parameters the asymptotic density of f is a density with m 
degrees of freedom. Therefore, we may take t{D, s) to be a y? variate, at least approximately, and solve 
t{D, s) = V? for s to get approximate n-standard deviation confidence infervals. In parficular, if we 
solve t{D,s) = 1, we obfain approximafe 68% infervals. This calculafion is whaf Minuit, and now 
TMinuit, has done counfless limes since Ihe 1970s! Wilks’ fheorem provides fhe main juslificalion for 
using fhe profile likelihood. We again use fhe fop discovery example fo illusfrale fhe procedure. 


Example: Top Quark Discovery Revisited Again 
The condifional MEE of b is found fo be 

9 + +4(1 -h k)Qs 

2(1 +A:) 

+ Q — (1 + k^s. 


b{s) = 

where 
9 = 


(18) 
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Fig. 6: (a) Contours of the D0 top discovery likelihood and the graph of b{s). (b) Plot of — In A(17, s) versus the 
expected signal s. The vertical lines show the boundaries of the approximate 68% interval. 


The likelihood p{D\s, b) is shown in Fig. [^a) together with the graph of b{s). The mode 
(i.e. the peak) occurs at s = s = — B. By solving 


-2 In 


Ppl(17|s) 

ppp(17|17-3.8) 


= 1 


for s we get two solutions s = 9.4 and s = 17.7. Therefore, we can make the statement 
s G [9.4,17.7] at approximately 68% C.L. Figurej^b) shows a plot of — In A(17, s) created 
using the RooFit and RooStats packages. 

Exercise 9: Verify this interval using the RooFit/RooStats package 

Intervals constructed this way are not guaranteed to satisfy the frequentist principle. In 
practice, however, their coverage is very good for the typical probability models used in 
particle physics, even for modest amounts of data. 


3.3 Hypothesis tests 


It is hardly possible in experimental particle physics to avoid testing hypotheses, testing that invariably 
leads to decisions. For example, electron identification entails hypothesis testing; given data D we ask: 
is this particle an isolated electron or is it not an isolated electron? In the discovery of the Higgs boson, 
we had to test whether, given the data available in early summer 2012, the Standard Model without a 
Higgs boson, a somewhat ill-founded background-only model, or the Standard Model with the boson 
of July 2012, the background -|- signal model, was the preferred hypothesis. We decided that the latter 
model was preferred and announced the discovery of a new boson. Given the ubiquity of hypothesis 
testing, it is important to have a grasp of the methods that have been invented to implement it. 


One method was due to Fisher |15|, another was invented by Neyman, and a third (Bayesian) 
method was proposed by Sir Harold Jeffreys p8| , all around the same time. We first describe the method 
of Fisher, then follow with a description of the method of Neyman. For concreteness, we consider the 
problem of deciding between a background-only model and a background -|- signal model. 
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3.3.1 Fisher’s approach 

In Fisher’s approach, we construct a null hypoth¬ 
esis, often denoted by Hq, and reject it should 
some measure be judged small enough to cast 
doubt on the validity of this hypothesis. In our 
example, the null hypothesis is the background- 
only model, for example, the SM without a Higgs 
boson. The measure is called a p-value and is de¬ 
fined by 

p-value(xo) = P{x > xo\Hq), (19) 

where x is a statistic designed so that large values 
indicate departure from the null hypothesis. This 
is illustrated in Fig. which shows the location 
of the observed value xq of x. The p-value is the probability that x could have been higher than the 
X actually observed. It is argued that a small p-value implies that either the null hypothesis is false 
or something rare has occurred. If the p-value is extremely small, say ~ 3 x 10“^, then of the two 
possibilities the most common response is to presume the null to be false. If we apply this method to the 
D0 top quark discovery data, and neglect the uncertainty in the null hypothesis, we find 

OO 

p-value = ^ Poisson(A^, 3.8) = 5.7 x 10“^. 

N=n 

We usually report a more intuitive number by converting the p-value to the scale defined by 

Z =-v/2erf~^(l — 2p-value). (20) 

This is fhe number of Gaussian standard deviations away from the meaij^ A p-value of 5.7 x 10“^ 
corresponds to a Z of 4.9cr. The Z-value can be calculated using the Root function 

Z = -TMath::NorniQuantile(p-value). 



Fig. 7: The p-value is the tail-probability, P{x > 
Xo\Hq), calculated from the probability density under 
the null hypothesis, Hq. Consequently, the probabil¬ 
ity density of the p-value under the null hypothesis is 
Uniform(a;, 1). 


3.3.2 Neyman’s approach 

In Neyman’s approach two hypotheses are considered, the null hypothesis Hq and an alternative hypoth¬ 
esis Hi- This is illustrated in Fig. In our example, the null is the same as before but the alternative 
hypothesis is the SM with a Higgs boson. Again, one generally chooses x so that large values would cast 
doubt on the validity of Hq. However, the Neyman test is specifically designed to respect the frequentist 
principle, which is done as follows. A fixed probability a is chosen called the significance (or size) of 
the test, which for a specific class of experiments corresponds to some threshold x^ defined by 

a = P{x > Xa\HQ). (21) 

Should the observed value xq > Xa, or equivalently, p-value(xo) < a, the hypothesis Hq is rejected 
in favor of the alternative. In particle physics, in addition to applying the Neyman hypothesis test, we 
also report the p-value. This is sensible because there is a more information in the p-value than merely 
reporting the fact that a null hypothesis was rejected at a significance level of a. 

"'erf(a;) = exp(—t^) dt is the error funtion and erf~^(a:) is its inverse. 
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The Neyman method satisfies the frequen- 
tist principle by construction. Since the signifi¬ 
cance of fhe fesf is fixed, a is fhe relative fre¬ 
quency wifh which true null hypofheses would be 
rejected and is called fhe Type I error rate. 

However, since we have specified an alfer- 
nafive hypofhesis fhere is more fhaf can be said. 

Figureshows fhaf we can also calculate 

j3 = P{x < Xa\Hi), (22) 

which is fhe relafive frequency wifh which we 
would rejecf fhe hypofheses of fhe form of Hi if 
they are true. These mistakes are called Type II 
errors. The quantity 1 — /3 is called the power of 
the test and is the relative frequency with which we would accept Hi if true. Obviously, for a given a we 
want to maximize the power. Indeed, this is the basis of the Neyman-Pearson lemma (see for example 
Ref. Q), which asserts that given two simple hypotheses — that is, hypotheses in which all parameters 
have well-defined values — fhe opfimal sfafisfic t fo use in fhe hypofhesis fesf is fhe likelihood rafio 
t = p{x\Hi)/p{x\Hq). 

Maximizing fhe power seems sensible. 
Consider Fig.|^ The significance of fhe fesf in fhis 
figure is fhe same as fhaf in Fig. so fhe Type I 
error rate is idenfical. However, fhe Type II error 
rate is much greater in Fig.j^fhan in Fig.[^ fhaf is, 
fhe power of fhe fesf is considerably weaker in fhe 
former. In fhaf case, fhere may be no compelling 
reason fo rejecf fhe null since fhe alfernafive is nof 
fhaf much belter. This insighl was one source of 
Neyman’s disagreemenl wifh Fisher. Neyman ob- 
jecfed fo fhe possibilify fhaf one mighf rejecf a null 
hypofhesis regardless of whefher if made sense fo 
do so. He insisted fhaf fhe fask is always one of 
deciding befween competing hypofheses. Fisher’s 
counler argumenl was fhaf an alfernafive hypofh¬ 
esis may nof be available, buf we may nonefheless wish fo know whefher fhe only hypofhesis fhaf is 
available is worfh keeping. As we shall see, fhe Bayesian approach also requires an alfernafive, in agree- 
menl wifh Neyman, buf in a way fhaf neilher he nor Fisher agreed wifh! 

We have assumed fhaf fhe hypofheses Hq and Hi are simple, fhaf is, fully specified. Unforfu- 
nalely, mosl of fhe hypofheses fhaf arise in realislic parlicle physics analyses are nof of fhis kind. In fhe 
Higgs boson discovery analyses by ATLAS and CMS fhe probabilify models depend on many nuisance 
paramefers for which only estimates are available. Consequenfly, neilher fhe background-only nor fhe 
background -|- signal hypofheses are fully specified. Such hypofheses are called compound hypotheses. 
In order fo illusfrafe how hypofhesis lesling proceeds in fhis case, we again lum again fo fhe lop discovery 
example. 

Example 

As we saw in Sec. |3.2[ fhe slandard way fo handle nuisance paramefers in fhe frequenlisl ap¬ 
proach is fo replace Ihem by Iheir condifional MLEs and Ihereby reduce fhe likelihood func¬ 
tion fo fhe profile likelihood. In fhe lop discovery example, we obfain a function ppl{D\s) 



Fig. 9: See Fig. |^for details. Unlike the case in Fig. 
the two hypotheses Hq and Hi are not that different. It is 
then not clear whether it makes practical sense to reject 
Hq when x > Xa only to replace it with an hypothesis 
Hi that is not much better. 



Fig. 8: Distribution of a test statistic x for two hypothe¬ 
ses, the null Hq and the alternative Hi. In Neyman’s 
approach to testing, a = P{x > XqIHq) is ^fixed proba¬ 
bility called the significance of the test, which for a given 
class of experiments corresponds the threshold Xa- The 
hypothesis Hq is rejected if x > Xa- 
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that depends on the single parameter, s. We now treat this function as if it were a likelihood 
and appeal to both the Neyman-Pearson lemma, which suggests the use of likelihood ratios, 
and Wilks’ theorem to motivate the use of the function t{x, s) given in Eq. ( [Tt] ) to distinguish 
between two hypotheses: the hypothesis Hi in which s = s = N — B and the hypothesis 
Ho in which s ^ s, for example, the background-only hypothesis s = 0. In the context of 
testing, t{x, s) is called a test statistic, which, unlike a statistic as we have defined it (see 
Sec.|2.1|), usually depends on at least one unknown parameter. 


In principle, the next step is the computationally arduous task of simulating the distribution 
of the statistic t{x, s). The task is arduous because a priori the probability density p{t\s, b) 
can depend on all the parameters that exist in the original likelihood. If this is the case, then 
after all this effort we seem to have achieved a pyrrhic victory! But, this is where Wilks’ 
theorem saves the day, at least approximately. We can avoid the burden of simulating t(x, s) 
because the latter is approximately a variate. 

Using N = n and s = 0, we find ^/to = = 4.6. According to the results shown 

in Fig. Q(a), = 17 may can be considered “a lot of data"; therefore, we may use to 

to implement a hypothesis test by comparing to with a fixed value ta corresponding to the 
significance level a of the test. 


4 Lecture 3: the Bayesian approach 


In this lecture, we introduce the Bayesian approach to inference, again using the top quark discovery data 
from D0 to illustrate the ideas. 


The Bayesian approach is merely applied probability theory (see Section 2.2 1 . 
Bayesian if 


A method is 


- it is based on the degree of belief interpretation of probability and 

- it uses Bayes theorem 


p(b,ujlL>) 

where 

D 

6 

LO 

p{9,uj\D) 
7r{9, uj) 


p{D\9, u) 7r{9, io) 

’ 

observed data, 
parameters of interest, 
nuisance parameters, 
posterior density, 
prior density (or prior for short). 


(23) 


for all inferences. The result of a Bayesian inference is the posterior density p{9,io\D from which, if 
desired, various summaries can be extracted. The parameters can be discrete or continuous and nuisance 
parameters are eliminated by marginalization. 


p{9\D) = J p{9,uj\D) du, 

oc / p{D\9,uj) Tr{9,uj) dix. 


(24) 


The function 7r{9,uj), called the prior, encodes whatever information we have about the parameters 9 
and io independently of the data D. A key feature of the Bayesian approach is recursion: the use of the 
posterior density p{9, u]\D) or one, or more, of its marginals such as p{9\D) as the prior in a subsequent 
analysis. 
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These simple rules yield an extremely powerful and general inference model, a model that was 
used, for example, in the discovery of single top quark production at the Tevatron |T9|20|. 


4.1 Model selection 

Conceptually, hypothesis testing in the Bayesian approach (also called model selection) proceeds exactly 
the same way as any other Bayesian calculation: we compute the posterior density. 


p{9,u;,H\D) 


p{D\e,uj,H)Tr{e,io,H) 

W) 


(25) 


and marginalize it with respect to all parameters except the ones that label the hypotheses or models, H, 


p{H\D)= p{e,uj,H\D)deduj. 


(26) 


Equation (26 1 is the probability of hypothesis H given the observed data D. In principle, the parameters 
(jj could also depend on H. For example, suppose that H labels different parton distribution function 
(PDF) models, say CTIO, MSTW, and NNPDF, then oj would indeed depend on the PDF model and 
should be written as coh- 

Fike a Ph.D., it is usually more convenient to arrive at the probability p{H\D) in stages. 


1. Factorize the prior in the most convenient form, 

7 r( 6 ', UJH, H) = TT{e, u:h\H) vr(iT), 

= Ti{e\uH,H)Tx{u:H\H)^{H), (27) 

or 

= TT{ujH\9,H)TT{e\H)TT{H). (28) 

Often, we can assume that the parameters of interest 9 are independent, a priori, of both the 
nuisance parameters ujh and the model label H, in which case we can write, tt{9,u!h, H) = 
7r(0) 7r{u!H\H) 

2. Then, for each hypothesis, H, compute the function 

p{D\H) = j p{D\e,ujH,H)7r{e,uj\H)d9du;. (29) 

3. Then, compute the probability of each hypothesis. 


p{H\D) 


p{D\H) 7 t{H) 


(30) 


Clearly, in order to compute p{H\D) it is necessary to specify the priors Tr{9,io\H) and vr(77). With 
some effort, it is possible to arrive at an acceptable form for tt{9,uj\H), however, it is highly unlikely 
that consensus could ever be reached on the discrete prior At best, one may be able to adopt a 

convention. For example, if by convention two hypotheses Hq and Hi are to be regarded as equally 
likely, a priori, then it would make sense to assign = tt{Hi) = 0.5. 

One way to circumvent the specification of the prior tt{H) is to compare the probabilities. 


p{Hi\D) _ \ p{D\Hi) 
p{Ho\D) [p{D\Ho 

and use only the term in brackets, called the global Bayes factor, Biq, as a way to compare hypotheses. 
The Bayes factor specifies by how much the relative probabilities of two hypotheses changes as a result of 


7t{Hi) 
7t{Ho) ■ 


(31) 
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incorporating new data, D. The word global indicates that we have marginalized over all the parameters 
of the two models. The local Bayes factor, Biq{6) is defined by 


Bm{e) 

where, 


p{D\0:H^) 
p{D\Ho) ’ 


p{D\e,H^) 


j p{D\9,ujHi,Hi)Tr{ujHi\Hi)du}Hi, 


(32) 


(33) 


are the marginal or integrated likelihoods in which we have assumed the a priori independence of 9 
and loh^ . We have further assumed that the marginal likelihood Hq is independent of 9, which is a very 
common situation. For example, 9 could be the expected signal count s, while lohi = ^ could be the 
expected background b. In this case, the hypothesis Hq is a special case of Hi, namely, it is the same as 
Hi with s = 0. An hypothesis that is a special case of another is said to be nested in the more general 
hypothesis. The example, discussed below, will make this clearer. 

There is a notational subtlety that may be missed: because of the way we have defined p{D\9, H), 
we need fo mulfiply p{D\9,H) hy fhe prior 7r(0) and fhen infegrafe wifh respecf fo 9 in order fo calculafe 
p{D\H). 


4.1.1 Priors 


Consfrucfing a prior for nuisance paramefers is generally neifher confroversial (for mosf parameters) 
nor problemafic. The Achilles heal of fhe Bayesian approach is fhe need fo specify fhe prior tt{9), 
for fhe paramefers of inferesf, af fhe sfarf of fhe inference chain when we know almosf nofhing abouf 
fhem. Careless specificalion of Ibis prior can yield resulfs fhaf are unreliable or even nonsensical. The 
mandatory requiremenf is fhaf fhe posterior densify be proper, fhaf is infegrafe fo unify. Ideally, fhe 
same should hold for priors. A very extensive liferafure exisfs on fhe topic of prior specificalion when 
fhe available information is exfremely limited. However, a discussion of fhis topic is beyond fhe scope 
of fhese lecfures. 


For model selection, we need fo proceed wifh caufion because Bayes faclor are sensifive fo fhe 
choice of priors and Iherefore less robusf fhan posterior densities. Suppose fhaf fhe prior 7r(6*) = Cf{9), 
where C* is a normalization consfanf. The global Bayes faclor for fhe Iwo hypolheses Hi and Hq can be 
wrillen as 


.Bio 


fpiD\9,Hi)f{9)d9 

p{D\Hq) 


(34) 


Therefore, if the constant C is ill defined, fypically because f f{9) d9 = oo, fhe Bayes faclor will 
likewise be ill defined. For fhis reason, if is generally recommended fhaf an improper prior nol be used 
for paramefers 9 fhaf occur only in one hypolhesis, here Hi . However, for paramefers fhaf are common fo 
all hypolheses, if is permissible fo use improper priors because fhe conslanls cancel in fhe Bayes faclor. 

The discussion so far has been somewhaf abslracl. The nexl section Iherefore works Ihrough an 
example of a possible Bayesian analysis of fhe D0 lop discovery dafa. 


4.2 The top quark discovery: a Bayesian analysis 

In this section, we shall perform the following calculations as a way to illustrate a typical Bayesian 
analysis, 

1. compute the posterior density p{s\D), 

2. compute a 68% credible interval [l{D),u{D)], and 

3. compute the global Bayes factor Biq = p{D\Hi)/p{D\Ho). 
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Probability model 

The first step in any serious statistical analysis is to think deeply about what has been done in the physics 
analysis and construct a probability model. The full probability model is the joint probability, 

p{x,s,h\I), 


which, as is true of all probability models, is conditional on the information and assumptions, I, that 
define the abstract space (see Sec. 2.2 1 . In these lectures, we have omitted the conditioning data I, but 
it should not be forgotten that it is always present and may differ from one probability model to another. 

The full probability model p{x, s, b) can be factorized in several mathematically valid ways. How¬ 
ever, we find it convenient to factorize the model in the following way. 


p{x, s, b) = p{x\s, b) 7r(s, b), 


(35) 


where we have introduced the symbol tt in order to highlight the distinction we choose to make between 
this part of the model and the remainder. We shall compute the likelihood from p{x\s, b) and view 7r(s, b) 
as the prior for s and b. We assume p{x\s, b) to be 

p{x\s, b) = Poisson(a:, s + b). (36) 

The interpretation of p{x\s, b) is clear: it is the probability to observe x events given that the mean event 
count is s + 6. What does 7r(s, 6) represent? This prior encodes what we know, or assume, about the mean 
background and signal independently of the potential observations x. The prior 7r(s, b) can be factored 
in two ways, 

7r(s, b) = 7r(s|6) vr(6), 

= 7r(6|s) 7r(s). (37) 

The factorizations remind us that the parameters s and b may not be independent. However, we shall 
assume that they are, at least at this stage of the analysis, in which case it is permissible to write, 

7r(s, 6) = 7r(s) 7r(6). (38) 


What do we know about the background? We know the count Q in the control region and we have 
an estimate of the control region to signal region scale factor k. Since Q is a count, a reasonable model 
for the likelihood is 


p{Q\k, b) = Poisson(Q, kb), (39) 

from which, together with a prior TT{k, b), we can compute the posterior density 

p{b\Q, k) = p{Q\k, b) Tr{k, b)/p{Q, k). (40) 

As usual, we factorize the prior, 'K{k, b) = 7r(fc|6)7ro(6), where we have introduced the subscript 0 to 
distinguish 7ro(&) from the background prior associated with Eq. (361. But, now we need to construct 
TT{k\b) and 7ro(6) using whatever information we have at hand. 

Clearly, b > 0. But, that miserable tidbit is all we know apart from the background likelihood, 
Eq. ( |39| ) ! Today, after a century of argument and discussion, the consensus amongst statisticians is that 
there is no unique way to represent such vague information. However, well founded ways to construct 
such priors are available, see for example Ref. |211 and references therein, but for simplicity we take the 
prior TTo{b) = 1, that is, the flat prior. If the uncertainty in k can be neglected, the (proper!) prior for k 
is 7r(A:|6) = d{k — B/6B‘^), which amounts to replacing k in Eq. (40l by B/SB"^. This yields. 


r, — kb 


p(j3\Q, k) = Gamma(/i:6, 1,Q + 1) = 


{kb)Q 


r(Q + i) ’ 


(41) 


for the posterior density of b, which can serve as the prior 7r(6) associated with Eq. (361. 
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By construction, p(x, s, b) is identical in 
form to the likelihood in Eq. (0; we have sim¬ 
ply availed ourselves of the freedom to factor¬ 
ize p(x, s, 6) as we wish and therefore to reinter¬ 
pret the factors. This freedom is useful because 
it makes it possible to keep the likelihood simple 
while relegating the complexity to the prior. This 
may not seem, at first, to be terribly helpful; af¬ 
ter all, we arrived at the same mathematical form 
as Eq. However, the complexity can be sub¬ 
stantially mitigated by sampling from the prior so 
that the model is represented by the relatively sim¬ 
ple likelihood and an ensemble of points that col¬ 
lectively represent the prior. The likelihood, as we 
have conceptualized the problem, is given by 



p{D\s,b) 


Idi 


where D = 17 events. 


(42) 

" ' Fig. 10: Posterior density computed for D0 top quark 

discovery data. The shaded area is the 68% central cred¬ 
ible interval. 


The final ingredienf is fhe prior vr(s). Af 
this stage, all we know is that s > 0 and, again, 

there is no unique way to specify 7r(s), though as noted there are well founded methods to construct it. 
We shall assume either the improper prior 7r(s) = 1 or the proper prior 7r(s) = d{s — 14). 


Marginal likelihood 

We have done the hard part: building the full probability model. Hereafter, the rest of the Bayesian 
analysis is mere computation. 

It is convenient to eliminate the nuisance parameter b, 


p{D\s,Hi)= / p{D\s,b)7T{b)d{kb), 

Jo 

1 ^ 

— (1 — x)^ ^ Beta(a:, r -|- 1, Q) Poisson(A^ 


Q 

where x = 1/{1 + k), 


-r\s), 


r=0 


(43) 


Exercise 10: Show this 

and thereby arrive at the marginal likelihood p(i2|s, Tfi). 

Posterior density 

Given the marginal likelihood p{D\s, Hi) and a prior 7r(s) we can compute the posterior density, 

pis\D, Hi) = p{D\s, Hi) Tr{s)/p{D\Hi), (44) 

where, 

POO 

p{D\Hi)= p{D\s,Hi)TT{s)ds. 

Jo 
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Assuming a flat prior for the signal, 7r(s) = l,we find 


p{s\D,H^) 


Beta(x, r + 1,Q) Poisson(A^ — r|s) 


(45) 


Exercise 11: Derive an expression for p(s|Z),ffi) assuming 
7r(s) = Gamma(gs, 1, M + 1) where q and M are constants 


from which we can compute the central credible interval [9.9,18.4] for s at 68% C.L., which is shown 
in Fig. 10 The statement s G [9.9,18.4] at 68% C.L. means there is a 68% probability that s lies in 


[9.9,18.4]. Unlike the frequentist statement, this statement is about this particular interval and the 68% 
is a degree of belief, not a relative frequency. That being said, the best Bayesian methods tend to produce 
credible intervals that also approximate confidence intervals. 


4.2.1 Bayes factor 

As noted, the number p{D\Hi) can be used to perform a hypothesis test. But, as argued above, we need 
to use a proper prior for the signal, that is, a prior that integrates to one. The simplest such prior is a 
5-function, e.g., 7r(s) = 5(s — 14). Using this prior, we find 

p{D\Hi) =p{D\U,Hi) = 9.28 x 10“2. 

Since the background-only hypothesis Hq is nested in Hi, and defined by s = 0, the number p{D\Ho) 
is given by p{D\0, Hi), which yields 

p{D\Ho) = p{D\0, Hi) = 3.86 x 10“®. 

We conclude that the hypothesis s = 14 is favored over s = 0 by a Bayes factor of 24,000. In order 
to avoid large numbers, the Bayes factor can be mapped into a (signed) measure akin to the frequentist 
“n-sigma" p2| , 

Z = sign (In Bio) y/2| In Bio\, (46) 

which gives Z = 4.5. Negative values of Z correspond to hypotheses that are excluded. 

Summary 

We have given an overview of the main ideas of statistical inference in a form directly applicable to sta¬ 
tistical analysis in particle physics. Two widely used approaches were covered, frequentist and Bayesian. 
Statistics is not physics. While Nature is the ultimate arbiter of which physics ideas are “correct", the 
ultimate arbiter of statistical ideas is intellectual taste. Therefore, we hope you take to heart the following 
advice. 


“Have the courage to you use your own understanding" 
Immanuel Kant 
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